11 research outputs found

    Text segmentation on multilabel documents: A distant-supervised approach

    Full text link
    Segmenting text into semantically coherent segments is an important task with applications in information retrieval and text summarization. Developing accurate topical segmentation requires the availability of training data with ground truth information at the segment level. However, generating such labeled datasets, especially for applications in which the meaning of the labels is user-defined, is expensive and time-consuming. In this paper, we develop an approach that instead of using segment-level ground truth information, it instead uses the set of labels that are associated with a document and are easier to obtain as the training data essentially corresponds to a multilabel dataset. Our method, which can be thought of as an instance of distant supervision, improves upon the previous approaches by exploiting the fact that consecutive sentences in a document tend to talk about the same topic, and hence, probably belong to the same class. Experiments on the text segmentation task on a variety of datasets show that the segmentation produced by our method beats the competing approaches on four out of five datasets and performs at par on the fifth dataset. On the multilabel text classification task, our method performs at par with the competing approaches, while requiring significantly less time to estimate than the competing approaches.Comment: Accepted in 2018 IEEE International Conference on Data Mining (ICDM

    Distant-supervised algorithms with applications to text mining, product search, and scholarly networks

    No full text
    University of Minnesota Ph.D. dissertation. November 2020. Major: Computer Science. Advisor: George Karypis. 1 computer file (PDF); 156 pages.In recent times, data has become the lifeblood of pretty much all businesses. As such, the real-world impact of data-driven machine learning has grown in leaps and bounds. It has set up itself as a standard tool for organizations to draw insights from the data at scale, and hence, to enhance their profits. However, one of the key-bottlenecks in deploying machine learning models in practice is the unavailability of labeled training data. The manually-labeled training sets are expensive and it can be a tedious exercise to create them. Besides, they cannot be practically reused for new objectives, if the underlying distribution of data changes with time. As such, distant-supervision provides a solution to using expensive hand-labeled datasets, which means leveraging alternative sources of weak-supervision. In this thesis, we identify and provide solutions to some of the challenges that can benefit from distant-supervised approaches. First, we present a distant-supervised approach to accurately and efficiently estimate a vector representation for each sense of the multi-sense words. Second, we present approaches for distant-supervised text-segmentation and annotation, which is the task of associating individual parts in a multilabel document with their most appropriate class labels. Third, we present approaches for query understanding in product search. Specifically, we developed distant-supervised solutions to three challenges in query understanding: (i) when multiple terms are present in a query, determining the relevant terms that are representative of the query’s product intent, (ii) vocabulary gap between the terms in the query and the product’s description, and (iii) annotating individual terms in a query with the corresponding intended product characteristics (product type, brand, gender, size, color, etc.). Fourth, we present approaches to estimate content-aware bibliometrics to accurately quantitatively measure the scholarly impact of a publication. Our proposed metric assigns content-aware weights to the edges of a citation network, that quantify the extent to which the cited-node informs the citing-node. Consequently, this weighted network can be used to derive impact metrics for the various involved entities, like the publications, authors, etc

    CAWA: An Attention-Network for Credit Attribution

    No full text
    Credit attribution is the task of associating individual parts in a document with their most appropriate class labels. It is an important task with applications to information retrieval and text summarization. When labeled training data is available, traditional approaches for sequence tagging can be used for credit attribution. However, generating such labeled datasets is expensive and time-consuming. In this paper, we present Credit Attribution With Attention (CAWA), a neural-network-based approach, that instead of using sentence-level labeled data, uses the set of class labels that are associated with an entire document as a source of distant-supervision. CAWA combines an attention mechanism with a multilabel classifier into an end-to-end learning framework to perform credit attribution. CAWA labels the individual sentences from the input document using the resultant attention-weights. CAWA improves upon the state-of-the-art credit attribution approach by not constraining a sentence to belong to just one class, but modeling each sentence as a distribution over all classes, leading to better modeling of semantically-similar classes. Experiments on the credit attribution task on a variety of datasets show that the sentence class labels generated by CAWA outperform the competing approaches. Additionally, on the multilabel text classification task, CAWA performs better than the competing credit attribution approaches1
    corecore